An intro to species distribution modeling for an iconic California woodland species.
In this post I’ll go through an introduction to species distribution modeling (SDM) using the dismo package. These are machine learning (algorithm) based approaches. We’ll go through the steps of acquiring observation data, generating absence points, gathering underlying climate data, and making predictions.
Here, we’ll use an SDM to make inferences about the distribution of suitable habitat for a species of interest. There are many SDM approaches out there all with their own pros and cons. Check out this table comparing different SDM options. Further, there are many limitations of SDMs (list articles here), their ability to gain ecological insight and predict species distribution across a landscape is valuable.
For this exercise we’ll use the blue oak (Quercus douglasii) tree. Blue oak trees are native and and endemic to California. They occur with a narrow range along the valleys and slopes of the mountain ranges that surround the Central Valley in California.
If you are having problems installing the rJava package like so:

Here are some troubleshooting options for installation on a macOS Monterey with an M1 architecture (ymmv).
Here are the steps I took to get rJava working:
rJava with remove.packages().
.R CMD javareconf in the terminal. This reconfigures JAVA_HOME directories.rJava in Rstudio and voila.We’ll set up our file structure here.
Let’s download our species data. We’ll limit it to the Global Biodiversity Information Facility (GBIF) data library using the spocc::occ function. Other data repositories are available (iNaturalist, eBird, Bison, iDigBio), but a lot of that data eventually ends up in GBIF. We’ll filter our results for CA only observations (native range) and remove “PRESERVED_SPECIMEN” which are Herbaria records. in CA To do this we’ll We then take the obs_csv dataframe and covert it to a spatial object using our coordinate data.
The function file.exists() returns a logical vector. I’ll use that notation to avoid re-downloading/re-creating files each time the document is compiled.
We’ll now get the underlying climate data for our SDM.
We’ll use the two terrestrial datasets, “WorldClim” and ENVIREM”.
There are 86 layers to choose from. Not all of them are needed so let’s select a few layers that are most relevant. We’ll use:
| Layer Code | name | description |
|---|---|---|
| WC_alt | Altitude | Altitude |
| WC_bio1 | Annual mean temp | Annual mean temp |
| WC_bio2 | Mean diurnal temp range | Mean of the monthly (max temp - min temp) |
| WC_bio6 | Minimum temp | Minimum temp of the coldest month |
| WC_bio12 | Annual precipitation | Annual precipitation |
| ER_tri | Terrain roughness index | Terrain roughness index |
| ER_topoWet | Topographic wetness | SAGA-GIS topographic wetness index |

These raster layers are currently global. Let’s crop the environmental rasters to a reasonable study area. There are two good options to clip our rasters:
sf package (sf::st_convex_hull(st_union()))USAboundaries packageNow let’s clip our environmental layers to this bounding box.

This table above is the **data* that feeds into our SDM (y ~ X), where: y is the present column with values of 1 (present) and X is all other columns: WC_alt, WC_bio1, WC_bio2, WC_bio6, WC_bio12, ER_tri, ER_topoWet, lon, lat
Maxent is probably the most commonly used species distribution model (Elith 2011) since it performs well with few input data points, only requires presence points (and samples background for comparison) and is easy to use with a Java graphical user interface (GUI).
MaxEnt requires presence-only observation points.
This is MaxEnt version 3.4.3

In our Variable Contribution plot we are checking out the percentage our environmental predictors contribute in our model. We see here that the top three predictors WC_bio12, WC_bio6, and WC_bio1 contribute approximately 90%.
Response curves show how each environmental variable affects the MaxEnt prediction and how that prediction changes as each variable is varied. The default in the dismo::response package is vary one variable while setting all other environmental variables to their mean. In other words, the curves show the marginal effect of changing exactly one variable.

Let’s make raw predictions of for species location

While these raw results are a good start, it’s important to calibrate your model. It is critical to check for multicollinearity between independent variables in your model. When you have multiple independent variables (in our case, our X values) that have high intercorrelation, this can lead to skewed or misleading results. This essentially widens your confidence intervals to produce less reliable results.
We first use a pairs plot to identify pairwise correlations between variables to identify pairs or groups of variables that are highly correlated. You can see below we have some strongly correlated variables. For our model, we’ll set our pairwise correlation threshold at 0.7.
We can detect multicollinearity (strong correlation between two or more independent variables) with a Variance Inflaction Factor (VIF) with the usdm::vif() function. A value with a VIF greater than 10 indicates that our model has a collinearity problem. We can reduce
We see that WC_alt, WC_bio1, and WC_bio6 all have VIF greater than 10. Let’s remove those from our raster stack.
3 variables from the 7 input variables have collinearity problem:
ER_topoWet WC_bio6 WC_bio1
After excluding the collinear variables, the linear correlation coefficients ranges between:
min correlation ( WC_bio2 ~ WC_alt ): -0.1005908
max correlation ( ER_tri ~ WC_alt ): 0.6104043
---------- VIFs of the remained variables --------
Variables VIF
1 WC_alt 1.639451
2 WC_bio2 1.156533
3 WC_bio12 1.474288
4 ER_tri 2.169617

Let’s re-compute our


